Sự phát triển của các kiến trúc Mô hình Ngôn ngữ lớn: Từ BERT đến GPT và T5

Ba Kiến trúc Transformer

Sự phát triển của các mô hình ngôn ngữ lớn được đặc trưng bởi một Đổi thay tư duy: chuyển đổi từ các mô hình chuyên biệt cho từng nhiệm vụ sang "Huấn luyện tổng hợp" nơi một kiến trúc duy nhất có thể thích nghi với nhiều nhu cầu xử lý ngôn ngữ tự nhiên (NLP).

Ở trung tâm của sự thay đổi này là cơ chế Tự chú ý (Self-Attention), cho phép mô hình đánh giá mức độ quan trọng của các từ khác nhau trong một chuỗi:

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. Chỉ Mã hóa (BERT)

Cơ chế:Mô hình hóa Ngữ nghĩa bị che (MLM).
Hành vi:Bối cảnh hai chiều; mô hình "nhìn thấy" toàn bộ câu cùng một lúc để dự đoán các từ bị ẩn.
Phù hợp nhất với:Hiểu hiểu ngôn ngữ tự nhiên (NLU), phân tích cảm xúc và nhận diện thực thể có tên (NER).

2. Chỉ Giải mã (GPT)

Cơ chế:Mô hình hóa Tự hồi quy.
Hành vi:Xử lý từ trái sang phải; dự đoán token tiếp theo dựa hoàn toàn vào ngữ cảnh trước đó (che khu vực nhân quả).
Phù hợp nhất với:Tạo văn bản ngôn ngữ tự nhiên (NLG) và sáng tác sáng tạo. Đây chính là nền tảng của các mô hình ngôn ngữ lớn hiện đại như GPT-4 và Llama 3.

3. Mã hóa - Giải mã (T5)

Cơ chế:Tranformer chuyển đổi Văn bản thành Văn bản.
Hành vi:Một bộ mã hóa xử lý chuỗi đầu vào thành một biểu diễn đặc đặc, sau đó một bộ giải mã tạo ra chuỗi mục tiêu.
Phù hợp nhất với:Dịch thuật, tóm tắt và các nhiệm vụ đối xứng.

Bí quyết quan trọng: Sự thống trị của Bộ Giải mã

Ngành công nghiệp đã phần lớn tập trung vào Chỉ Giải mãkiến trúc nhờ luật mở rộng vượt trội và khả năng suy luận nổi bật trong các tình huống không có mẫu huấn luyện (zero-shot).

Ảnh hưởng của cửa sổ ngữ cảnh đến VRAM

Trong các mô hình chỉ Giải mã, Bộ đệm KVtăng tuyến tính theo độ dài chuỗi. Một cửa sổ ngữ cảnh 100k yêu cầu VRAM đáng kể hơn so với cửa sổ 8k, làm cho việc triển khai các mô hình ngữ cảnh dài tại chỗ trở nên khó khăn nếu không sử dụng lượng tử hóa.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?

Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.

Encoders cannot process text bidirectionally.

Decoders require less training data for classification tasks.

Encoders are incompatible with the Self-Attention mechanism.

Question 2

Which architecture treats every NLP task as a "text-to-text" problem?

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5)

Recurrent Neural Networks (RNN)

Challenge: Architectural Bottlenecks

Analyze deployment constraints based on architecture.

If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.

Step 1

Identify the architectural bottleneck regarding context processing.

Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.

Step 2

Justify the preference using Scaling Laws.

Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.